DOCUMENT RESUME 



TM 033 865 

Pommerich, Mary 

The Effect of Administration Mode on Test Performance and 
Score Precision, and Some Factors Contributing to Mode 
Differences. 

2002-04-00 

6 7p. ; Paper presented at the Annual Meeting of the National 
Council on Measurement in Education (New Orleans, LA, April 
2-4, 2002) . 

Numerical/Quantitative Data (110) -- Reports - Research 

(143) -- Speeches/Meeting Papers (150) 

MF01/PC03 Plus Postage. 

Achievement Tests; Adaptive Testing; *Computer Assisted 
Testing; Computer Literacy; *High School Students; High 
Schools; Scores; *Test Format; Test Results; Testing 
Problems 

Item Parameters; *Paper and Pencil Tests 

This paper considers differences in modes of test 
administration, addressing three questions: (1) Do examinees respond to items 

in the same way across administration modes and computer interface 
variations? (2) What are some of the factors that can contribute to modal 
effects? and (3) Can item parameters calibrated from paper and pencil 
administrations be used for computer administrations? The questions were 
examined using data from paper and pencil and computer administrations of a 
fixed-form test in two different examinee samples. Several of the tests 
studied were complex. An initial comparability study was performed in 1998 
involving approximately 8,600 students, and in response to that study, 
revisions were made to the computer interfaces. A follow-up study was 
performed in 2000 with approximately 12,000 examinees. This paper examines 
performance differences across paper and computer modes and across computer 
interface variations in both studies. Results are summarized at the total 
test level and for some individual items. Some factors that might have 
contributed to mode differences or affected computer performance in general 
are discussed. A small simulation study was also performed to examine the 
effect of using item parameters calibrated from paper and pencil 
administrations in a computer administration. Some items showed no 
performance differences across administration modes, but other items did. A 
variety of factors appeared to contribute to mode effects, and each item 
seemed to have a unique set of circumstances. Changes to the computer 
interface appeared to affect performance. Overall, results suggest that while 
performance effects do occur across modes, they have a fairly small effect in 
practice. Simulation results suggest that item parameters calibrated from 
paper and pencil tests could probably be used initially in a computer 
administration. (Contains 18 tables.) (SLD) 



ED 464 934 

AUTHOR 

TITLE 

PUB DATE 
NOTE 

PUB TYPE 

EDRS PRICE 
DESCRIPTORS 

IDENTIFIERS 

ABSTRACT 



Reproductions supplied by EDRS are the best that can be made 
from the original document. 



TM033865 



m 

o\ 

\o 



Q 



W 



PERMISSION TO REPRODUCE AND 
DISSEMINATE THIS MATERIAL HAS 
BEEN GRANTED BY 

/VI- Pnmmgyi c h 



f . DEPARTMENT OF EDUCATION 
of Educational Research and Improvement 
TIONAL RESOURCES INFORMATION 
CENTER (ERIC) 

document has been reproduced as 
received from the person or organization 
originating it. 

□ Minor changes have been made to 
improve reproduction quality. 



TO THE EDUCATIONAL RESOURCES 
• INFORMATION CENTER (ERIC) 



• Points of view or opinions stated in this 
document do not necessarily represent 
official OERI position or policy. 



The Effect of Administration Mode on Test Performance and Score Precision, 
and Some Factors Contributing to Mode Differences 

Mary Pommerich 
Defense Manpower Data Center 



Paper presented at the annual meeting of the National Council on Measurement in Education, 
New Orleans, LA, April 2002. 




2 



BEST COPY AVAILABLE 



Acknowledgements 



Many people contributed to the planning, design, and data collection for the studies presented in 
this paper, including members of the Support, Technological Applications and Research 
Department, the Elementary and Secondary School Programs Department, the Measurement 
Research Department, the Operations Department, the Educational Services Department, and the 
Statistical Research Department, all at ACT. 

Many people contributed to the design and development of the computer interfaces and tutorials 
used in the studies presented in this paper, including members of the Support, Technological 
Applications and Research Department, the Elementary and Secondary School Programs 
Department, the Placement Programs Department, the Systems Support Department, and the 
Educational Technology Center, all at ACT. 

Many people contributed to data analyses that were conducted for the studies presented in this 
paper, including members of the Measurement Research Department and the Support, 
Technological Applications and Research Department, all at ACT. 

In particular, thanks to Dean Colton and Han-wei Chen for identifying problematic records and 
cleaning the datasets discussed in this paper. Thanks to Jim Patterson, David Duer, Ann Gordon, 
and Beth Gehring for developing hypotheses presented in this paper. Thanks to Brad Hanson for 
supplying the results from the equating study presented in the paper. Finally, thanks to Dan 
Segall for his assistance on the simulation portion of the paper. 



The Effect of Administration Mode on Test Performance and Score Precision, 
and Some Factors Contributing to Mode Differences 

As testing moves from paper and pencil administration toward computerized 
administration, how to present complex tests on a computer screen becomes an important 
concern. Information that can be viewed in full on a two-page spread in a booklet cannot 
typically be presented on a single computer screen. In a dual-platform testing program with a 
complex test, taking certain items in one mode or the other could possibly advantage some 
examinees. Even in a computer-only platform, decisions about how to present the test could 
affect examinee performance. Seemingly subtle differences in how the test is presented on 
computer could have a not-so-subtle effect on examinee performance. 

Computerized administration is less of an issue for discrete-item tests such as Math, if 
single items can be presented in full on a computer screen. Computerized administration is more 
of an issue for complex tests that contain information that cannot all be displayed on-screen at 
once for an item. For example, a test with long text-based passages is complex if the examinee 
must navigate through the passages to read and find answers to items. A test with text-based 
passages containing multiple figures or tables per passage is complex, particularly if the figures 
and tables need to be compared. As computerized presentation of tests becomes more of a 
reality, it is important to develop an understanding of the presentation choices we make and how 
they can affect an examinee’s performance. Presenting a complex test on computer is not an easy 
task; many decisions need to be made about how best to present the information, so that the 
method of presentation does not interfere with examinee performance on the test. 

Because of potential mode effects, Parshall, Spray, Kalohn, and Davey (2002) suggest 
that testing programs that treat scores across different administration platforms as equivalent 
should perform studies to document the comparability of the test scores. Mode effects also are 
an important consideration if items are calibrated in one medium, and then used operationally in 
another medium. For example, when starting a computerized testing program, it may be very 
costly and time-consuming to calibrate the initial pool(s) using data from computer 
administrations of the items. Every item in the pool would have to be administered to a 
sufficient number of examinees in order to calibrate. If the item pool is large, this would require 
a substantial amount of testing that likely cannot be done quickly (or cheaply) via computer 
administration. Thus, a testing program might consider initially using item parameters calibrated 




3 



4 



from paper and pencil administrations of the items for operational computer administrations, 
until enough data are obtained to calibrate from the computer administrations. Parshall, et al. 
(2002) caution that item calibrations based on paper and pencil administrations might not 
represent the performance of those same items in a computer administration. 

This paper addresses three questions: 

(1) Do examinees respond to items in the same way across administration modes and 
computer interface variations? 

(2) What are some of the factors that can contribute to mode effects? 

(3) Can item parameters calibrated from paper and pencil administrations be used for 
computer administrations? 

The questions are examined using data from paper and pencil and computer 
administrations of a fixed-form test in two different examinee samples. Several of the tests 
studied were complex. An initial comparability study was performed in 1998. In response to 
findings from that study, revisions were made to the computer interfaces, and then a follow-up 
comparability study was performed in 2000. The paper examines performance differences across 
paper and computer modes and across computer interface variations in both studies. Results are 
summarized at the total test level and for some individual items. Some factors that might have 
contributed to mode differences or affected computer performance in general are discussed. In 
addition, a small simulation study was performed to examine the effect of using item parameters 
calibrated from paper and pencil administrations in a computer administration. 

Description of the Tests, Computer Interfaces, and Comparability Studies 

Two comparability studies were performed in 1998 and 2000, called Comparability 1 and 
Comparability 2, respectively. In each study, the same fixed-form tests were administered across 
paper and pencil and computer modes in the content areas of English, Math, Reading, and 
Science Reasoning. Slightly different Math tests were used across the two comparability studies. 
As such, results from Math are not presented here. An initial computer interface was used in 
Comparability 1 and then modified for Comparability 2 based on findings from Comparability 1 . 
The interface used in Comparability 1 is referred to as Interface 1 . The interface used in 
Comparability 2 is referred to as Interface 2. 



5 

o 

ERIC 



4 



English 

Test Content 

The English test consisted of four passages containing underlined words and phrases, 
with 15 multiple-choice items in each passage (60 items total). For most items, examinees were 
instructed to choose the response option for the underlined portion that best expressed the idea, 
made the statement appropriate for standard written English, or was worded most consistently 
with the style and tone of the passage as a whole. These types of items had no stimulus 
associated with them (i.e., there were only response options, and no preceding question). For 
some items, there was a stimulus present that asked a question about the underlined portion in 
the passage. Examinees were instructed to choose the best answer to the question. 

Booklet Presentation 

In the booklet presentation of the English test, the passage and items were presented 
jointly on a page. The passage was presented in the left half of the page, while the items were 
presented in the right half of the page. Each underlined portion was always aligned with the top 
of the item. The passages and accompanying items occupied about two booklet pages each. 
Examinees were able to move freely throughout all English passages and items in the booklet 
while taking the English test. They could respond to items and passages in any order, and were 
not required to give responses to all items. Similar rules of movement between items and 
passages held for the Reading, and Science Reasoning paper and pencil tests. Within a single 
test, examinees were allowed to move freely throughout the test. 

Computer Presentation 

In the computer presentation for both Interface 1 and Interface 2, the passage and items 
were presented jointly on the screen, with the passage on the left half and the items on the right 
half of the screen. The passage was not visible in its entirety on the computer screen. The 
examinee had to scroll through the passage to see the passage in its entirety, although the passage 
automatically scrolled for examinees on various items (see further discussion below). Items 
were presented one at a time. Within a passage, examinees were allowed to answer items in any 
order. They were required to answer all items prior to moving on to the next passage. Once an 
examinee completed a passage and moved on to the next passage, they were not allowed to 
return to the previous passage. Also, passages were presented one at a time, so that examinees 
could not see the next passage until they proceeded to it. A similar presentation of the passage 




5 



6 



and item windows was used with the computerized Reading and Science Reasoning tests, along 
with the same rules for moving between items and passages. 

Computer Interface Features 

In Interface 1, the following features were utilized: 

• The full underlined portion in the passage window was highlighted. 

• The passage automatically scrolled when an item was selected that was not visible on screen 
(about every 6 th item). 

• The underlined portions were not aligned with the top of the question. 

Results from Comparability 1 (to be discussed in more detail later) showed that whereas 
some individual items favored computer examinees and some individual items favored paper 
examinees, as a whole, the test tended to favor computer examinees. After a review of the test 
content, test booklet, computer interface, and discussions with examinees, the following 
hypotheses were posited as possible explanations for why computer examinees performed better 
overall than paper examinees (see Pommerich & Burden (2000) for further discussion): 

• The use of full highlighting was advantageous to computer examinees as it drew their 
attention to the full underlined portion. 

• Computer examinees on some items were better able to focus on relevant sections of 
passages/items because those sections were centered in the passage and item windows and 
examinees were not distracted by extraneous information presented in the rest of the test. 
This phenomenon will be referred to as the “focus effect.” 

Results also suggested the following hypotheses: 

• Computer examinees might have been less likely to read the stimulus preceding the response 
options, for items containing a stimulus. 

• Where the underlined portion was aligned with the response options might influence the 
response selected. 

Thus, the following changes were implemented for Interface 2: 

• The full highlighting of the underlined portion was removed. Instead, only the item number 
underneath the underlined portion was highlighted. 

• The item number was placed adjacent to the top of the question within the item window to 
match what is done in the booklets (in Interface 1, the items were not numbered adjacent to 
the question.) 




6 



7 



• Two automatic scrolling variations were compared: 

• The passage scrolled when an item was selected that was not visible on screen (about 
every 6 th item), so that the underlined portion was not always aligned with the top of 
the question. This scrolling was used in both Interface 1 and Interface 2. This 
condition will be referred to as English Semi. 

• The passage scrolled every time a new item was selected, so that the underlined 
portion was always aligned with the top of the question. This scrolling was only 
used in Interface 2. This condition will be referred to as English Auto. 

Reading 

Test Content 

The Reading test consisted of four passages with 1 0 multiple-choice items on each 
passage (40 items total). Examinees were instructed to read the passage and choose the best 
answer to each question. Items on the Reading test generally fell into two types: questions that 
required a global understanding of the passage and questions that required knowledge of specific 
information given in the passage. For global questions, examinees typically had to make an 
inference from what they had read to answer the question. Some of the items had line references 
associated with them (i.e., the item stimulus contained the number of a line or lines in the 
passage to which they were directed to read). In the booklet presentation, the reading passage 
was presented first in its entirety, in two columns per page. The passages were followed by the 
test items. The passages and accompanying items occupied about two booklet pages each. The 
computer presentation for Reading corresponded to that described for the English test. 

Computer Interface Features 

In Interface 1, the following features were utilized: 

• Examinees moved through the passage by scrolling. 

• Examinees could scroll line-by-line, or use a sliding scroll bar to move quickly through the 
passage. 

• Line-by-line scrolling speed was not very fast. 

• Pre-test training for scrolling options was for line-by-line scrolling only (examinees were not 
explicitly shown how to use the sliding scroll bar). 

• Line breaks were not the same as in the booklet, so the content of referenced lines was not 
the same across modes. 




7 



8 



Results from Comparability 1 (to be discussed in more detail later) showed that whereas 
some individual items favored computer examinees and some individual items favored paper 
examinees, as a whole, the test tended to favor paper examinees. After a review of the test 
content, test booklet, computer interface, and interviews with examinees, the following 
hypotheses were posited as possible explanations for why paper examinees performed better 
overall than computer examinees (see Pommerich & Burden (2000) for further discussion): 

• Computer examinees sometimes had difficulty locating information in the passage with 
scrolling as the navigation method. 

• Paper examinees might have been more likely than computer examinees to experience 
“positional memory,” whereby they remembered the location of information given in the 
passage, because the passage occurred in a fixed position on the page. 

• Slow scrolling speed was a hindrance for computer examinees. 

• Different line breaks could have created mode differences on questions with line references. 
Thus, the following changes were implemented for the Interface 2: 

• Line breaks for the passages were made the same across booklet and computer 
representations, so that each line contained the same content across modes. 

• Scrolling speed was increased. 

• Examinees were explicitly taught to use the sliding scroll bar prior to testing. 

• Two navigation variations were compared: 

• Examinees moved through the passage by scrolling, using either line-by-line scrolling 
or a sliding scroll bar. This scrolling was used in Interface 1 and Interface 2 
(although scrolling speed was increased and pre-test instruction on scrolling was 
more comprehensive for Interface 2). This condition will be referred to as Read 

Scroll. 

• Examinees moved through the passage by paging. In this variation, the passage was 
divided into separate pages and the examinee moved between pages by clicking on a 
specific page number, or by using “Next Page” or “Previous Page” buttons. Paging 
was only used in Interface 2. This condition will be referred to as Read Page. 




8 



9 



Science Reasoning 

Test Content 

• The Science Reasoning test consisted of seven passages with varying numbers of multiple- 
choice items per passage (5-7 items per passage; 40 items total). Some passages contained 
figures and tables. In the booklet presentation, the passage was presented first in its entirety, 
in two columns per page. The passages and accompanying items occupied about two booklet 
pages each. The passages were followed by the test items. The computer presentation for 
Science Reasoning corresponded to that described for the English test, with the additional 
feature that some figures and tables within the passage were enlargeable and moveable. 

Computer Interface Features 

In Interface 1, the following features were utilized: 

• Examinees moved through the passage by scrolling. 

• Examinees could scroll line-by-line, or use a sliding scroll bar to move quickly through the 
passage. 

• Line-by-line scrolling speed was not very fast. 

• Pre-test training for scrolling options was for line-by-line scrolling only (examinees were not 
explicitly shown how to use the sliding scroll bar). 

Results from Comparability 1 (to be discussed in more detail later) showed some 
individual items favoring computer examinees and some individual items favoring paper 
examinees. Overall, there was no trend in results, although the last passage (Passage 7) favored 
paper examinees, and Passage 4 favored computer examinees. After a review of the test content, 
test booklet, computer interface, and interviews with examinees, the following hypotheses were 
posited as possible explanations for why computer and paper examinees performed differently on 
individual items/passages (see Pommerich & Burden (2000) for further discussion): 

• Computer examinees sometimes had difficulty locating information given in the passage with 
scrolling as the navigation method. 

• Paper examinees might have been more likely than computer examinees to experience 
“positional memory,” whereby they remembered the location of information in the passage, 
because the passage occurred in a fixed position on the page. 

• Slow scrolling speed was a hindrance for computer examinees. 






9 



10 



• Computer examinees had difficulty comparing information across two tables or figures, and 
were unaware that they could move an enlarged graphic so that it could be viewed at the 
same time as another graphic. 

• Computer examinees were advantaged by a “focus effect” on some items (i.e., they were 
better able to focus on relevant sections of passages/items because those sections were 
centered in the passage and item windows and examinees were not distracted by extraneous 
information presented in the rest of the test). 

Thus, the following changes were implemented for Interface 2: 

• Scrolling speed was increased. 

• Examinees were explicitly taught to use the sliding scroll bar prior to testing. 

• Two graphics were allowed to be enlarged simultaneously and moved so they could be 
viewed side-by-side. 

• Two navigation variations were compared: 

• Examinees moved through the passage by scrolling, using either a line-by-line 
scrolling or a sliding scroll bar. This scrolling was used in Interface 1 and Interface 2 
(although scrolling speed was increased and pre-test instruction on scrolling was 
more comprehensive in Interface 2). This condition will be referred to as Science 

Scroll. 

• Examinees moved through the passage by paging. In this variation, the passage was 
divided into separate pages and the examinee moved between pages by clicking on a 
specific page number, or by using “Next Page” or “Previous Page” buttons. Paging 
was only used in Interface 2. This condition will be referred to as Science Page. 

Changes Between Interface 1 and Interface 2 for AH Tests 

The following changes were implemented between Interface 1 and Interface 2 the same way over 

all tests unless otherwise indicated: 

• Changed the wording on some buttons and on text adjacent to the buttons to be more concise 
and clear. 

• Different colors and button designs were used to change the look of the interface. 

• Additional passage and item numbering was added to the outside of the passage and item 
windows, to clarify which item/passage the examinee was on (i.e., indicated Passage 1 of 4, 




10 



ll 



Question 1 of 60, etc. In Interface 1 , the current passage and question number was given, but 
no information was given as to how many passages or questions remained.) 

• On startup of a passage, the first question was not displayed until the examinee selected the 
first item, to encourage examinees to read the passage before answering the first question. 
The Comparability Studies 
Comparability 1 

Comparability 1 compared performance across computer and paper and pencil 
administrations of the same fixed form, using computer Interface 1 . Testing was conducted 
between September and December in 1998. A total of 40 schools participated in the study, with 
approximately 8600 students testing overall. Within a school, examinees were randomly 
assigned to a paper and pencil or computer administration of a fixed-form test. Within each 
administration mode, examinees were randomly assigned to one of the following content areas: 
English, Math, Reading, or Science Reasoning. (Note that only one computer interface variation 
was used in each content area.) Thus, there were a total of eight administration conditions. All 
computer examinees took a short tutorial prior to testing that demonstrated how to use all of the 
functions necessary to take the computerized test (with the exception of demonstrating the use of 
the sliding scroll bar, as discussed earlier). The fixed forms were drawn from an intact paper and 
pencil form. Reading and Science Reasoning were administered in their entirety with the same 
time constraints as used operationally, while a representative subset of items was selected from 
the English and Math tests to accommodate a 35-minute testing period. Total testing time was 
35 minutes for all content areas and modes. 

Comparability 2 

Comparability 2 compared performance across computer and paper and pencil 
administrations of the same fixed form, using computer Interface 2. Interface 2 was a modified 
version of Interface 1, developed in response to findings from Comparability 1. Testing was 
conducted between October 2000 and January 2001. A total of 61 schools participated in the 
study, with approximately 12,000 examinees testing. Within a school, examinees were randomly 
assigned to a paper and pencil or computer administration of a fixed-form test. Examinees 
assigned to the paper mode were randomly assigned to one of the following content areas: 
English, Math, Reading, or Science Reasoning. Examinees assigned to the computer mode were 
randomly assigned to one of the following content area and interface variations: English Auto, 



ERIC 



ii 



12 



